DILEMMA-2: A Lemmatizer-Tagger For Medical Abstracts
نویسندگان
چکیده
This paper reports on the development of DILEMMA-2*, a lemmatizer-tagger for the sublanguage of medical abstracts. The program is an extension of DILEMMA-I, a lemmatizertagger for general English texts. In the first section a brief outline is given of DILEMMA-1. Particular attention _is paid to the original concept of a default category which is linked with a categorial graph by means of a pointer system. In the second section we show why DILEMMA-1 was not able to get a suitable score when lemmatizing medical abstracts, the main reason being the inability to recognize sublanguage specific vocabulary. In the next section a description is given of the most important errors along with their solutions; these errors are then categorized as gaps or wrong assignments. The former could be dealt with in either a suffix list or a gaps filler default. The latter mainly concerned wrongly assigned past participles and errors on noun, verb or adjective assignment. After implementation of the proposed solutions, a comparison is made between the results of DILEMMA-1 and DILEMMA-2, showing that the results of DILEMMA-1 have been improved substantially within a sublanguage context, and this by using linguistic, i.e. sublanguage, knowledge, thus avoiding ad hoc remedies. DILEMMA-2 was developed as part of a research contract for Elsevier Science Publishers (ESP), Amsterdam, The Netherlands. The development of DILEMMA-lwas carried out as part of contract research for Van Dale Lexicografie Publishers, Utrecht, The Netherlands. In this paper we describe DILEMMA-2, a lemmatizertagger for medical abstracts, which is an updated version of DILEMMA-1, a lemmatizer-tagger for general texts. After a brief outline of DILEMMA-1 we give a description of the types of errors we found when running the general lemmatizer on medical abstracts. This is followed by some examples of the solutions we proposed and implemented into DILEMMA-2. Finally, the results of DILEMMA-I and DILEMMA-2 are compared, showing that a sublanguage approach can lead to workable results in the development of real world applications.
منابع مشابه
Different Issues in the Design of a Lemmatizer/Tagger for Basque
This paper presents relevant issues that have been considered in the design of a general purpose lemmatizer/tagger for Basque (EUSLEM). The lemmatizer/tagger is conceived as a basic tool necessary for other linguistic applications. It uses the lexical data base and the morphological analyzer previously developed and implemented. Due to the characteristics of the language, the tagset here propos...
متن کاملA Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German1
In this paper we present Morphy, an integrated tool for German morphology, part-ofspeech tagging and context-sensitive lemmatization. Its large lexicon of more than 320,000 word forms plus its ability to process German compound nouns guarantee a wide morphological coverage. Syntactic ambiguities can be resolved with a standard statistical part-of-speech tagger. By using the output of the tagger...
متن کاملA Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German
In this paper we present Morphy, an integrated tool for German morphology, part-ofspeech tagging and context-sensitive lemmatization. Its large lexicon of more than 320,000 word forms plus its ability to process German compound nouns guarantee a wide morphological coverage. Syntactic ambiguities can be resolved with a standard statistical part-of-speech tagger. By using the output of the tagger...
متن کاملAn Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes
In spite of its robust syntax, semantic cohesion, and less ambiguity, lemma level analysis and generation does not yet focused in Arabic NLP literatures. In the current research, we propose the first non-statistical accurate Arabic lemmatizer algorithm that is suitable for information retrieval (IR) systems. The proposed lemmatizer makes use of different Arabic language knowledge resources to g...
متن کاملTransferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Abstract We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adap...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1992